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Abstract. The problem of searching for a model-based scene interpre- 
tation is analyzed within a probabilistic framework. Object models are 
formulated as generative models for range data of the scene. A new statis- 
tical criterion, the truncated object probability, is introduced to infer an 
optimal sequence of object hypotheses to be evaluated for their match to 
the data. The truncated probability is partly determined by prior knowl- 
edge of the objects and partly learned from data. Some experiments on 
sequence quality and object segmentation and recognition from stereo 
data are presented. The article recovers classic concepts from object 
recognition (grouping, geometric hashing, alignment) from the proba- 
bilistic perspective and adds insight into the optimal ordering of object 
hypotheses for evaluation. Moreover, it introduces point-relation densi- 
ties, a key component of the truncated probability, as statistical models 
of local surface shape. 
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1 Introduction 

Model-based object recognition or, more generally, scene interpretation can be 
conceptualized as a two-part process: one that generates a sequence of hypothe- 
ses on object identities and poses, the other that evaluates them based on the 
object models. Viewed as an optimization problem, the former is concerned with 
the search sequence, the latter with the objective function. Usually the evalua- 
tion of the objective function is computationally expensive. A reasonable search 
algorithm will thus arrive at an acceptable hypothesis within a small number of 
such evaluations. 

In this article, we will analyze the search for a scene interpretation from a 
probabilistic perspective. The object models will be formulated as generative 
models for range data. For visual analysis of natural scenes, that is, scenes that 
are cluttered with multiple, non-completely visible objects in an uncontrolled 
context, it is a highly non-trivial task to optimize the match of a generative 
model to the data. Local optimization techniques will usually get stuck in mean- 
ingless, local optima, while techniques akin to exhaustive search are precluded 



by time constraints. The critical aspect of many object recognition problems 
hence concerns the generation of a clever search sequence. 

In the probabilistic framework explored here, the problem of optimizing the 
match of generative object models to data is alleviated by the introduction of 
another statistical match criterion that is more easily optimized, albeit less re- 
liable in its estimates of object parameters. The new criterion is used to define 
a sequence of hypotheses for object parameters of decreasing probability, start- 
ing with the most probable, while the generative model remains the measure for 
their evaluation. For an efficient generation of the search sequence, it is desirable 
to make the new criterion as simple as possible. On the other hand, for obtaining 
a short search sequence for acceptable object parameters, it is necessary to make 
the criterion as informative as is feasible. As a key quantity for sequence opti- 
mization, an estimate of features' posterior probability enters the new criterion. 
The method is an alternative to other, often more heuristic strategies aimed at 
producing a short search sequence, such as checking for model features in the 
order of the features' prior probability |t], in a coarse-to-fine hierarchy ||, 
or as predicted from the current scene interpretation [Q ||. 

Classic hypothesize-and-test paradigms have been demonstrated in RANSAC- 
like [^, [[9) and alignment techniques ||, . The method proposed here is more 
similar to the latter in that testing of hypotheses is done with respect to the full 
data rather than on a sparse feature representation. It significantly differs from 
both, however, in that it recommends a certain order of hypotheses to be tested, 
which is indeed the main point of the present study. Other classic concepts 
that are recovered naturally from the probabilistic perspective are feature-based 
indexing and geometric hashing Jl3| , pof , and feature grouping and perceptual 
organization Q [l5| 

To keep the notational load in this article to a minimum and to support ease 
of reading, we will denote all probability densities by the letter p and indicate 
each type of density by its arguments. Moreover, we will not introduce symbols 
for random variables but only for the values they take. It is understood that 
probability densities become probabilities whenever these values are discrete. 



2 Generative Object Models for Range Data 

Consider the task of recognizing and localizing a number of objects in a scene. 
More precisely, we want to estimate object parameters (c, p) from data, where 
c e IN is the object's discrete class label and p € 1R 6 are its pose parameters 
(rotation and translation) . Suppose we use a sensor like a set of stereo cameras 
or a laser scanner to obtain range-data points d £ H 3 of one view of the scene, 
that is, 2 + 1/2 dimensional (2 + 1/2D) range data. A reasonable generative 
model of the range-data points within the volume V(c, p) of the object c with 
pose p is given by the conditional probability density 

n(A\r ^_/^(c)/[0(d; C , P )]forde5( c ,p), 

P(a\c,P) \ N ( c ) b forde V(c,p)\«S(c,p). { > 
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The condition is on the object parameters (c, p). The set S(c, p) C V(c, p) is the 
region close to visible surfaces of the object c with pose p, b > is a constant that 
describes the background of spurious data points, and N(c) is a normalization 
constant. The surface density of points is described by a function / of the angle 
4>(d; c, p) between the direction of gaze of the sensor and the object's inward 
surface normal at the (close) surface point d e S(c, p). The function / takes its 
maximal value, that we may set to 1, at <j> = 0, i.e., on surfaces orthogonal to 
the sensor's gazing direction. 

Four comments are in order. First, the normalization constant N(c) generally 
depends upon both c and p. However, we here neglect its dependence on the 
object pose p. The consequence is that objects will be harder to detect, if they 
expose less surface area to the sensor. Second, the extension of the set S(c, p) 
has to be adapted to the level of noise in the range data. Third, points d e 
S(c, p) that are close to but not on the surface have assigned surface normals of 
close surface points. Fourth, given the object parameters (c, p), the data points 
d £ V(c, p) can be assumed statistically independent with reasonable accuracy. 

Our task is to estimate the parameters c and p by optimizing the match 
between the generative model (0) to the range data D. Because of the condi- 
tional independence of the data points in V(c, p), this means to maximize the 
logarithmic likelihood 

L(c,p;D):= £ L(c,p;d) := £ lnp(d|c,p) (2) 

d£flnV(c,p) d£DnV(c,p) 

with respect to (c, p) £ J? C IN x 1R 6 . A computationally efficient version is 
obtained by assuming that (f> <C 1, which is true for patches of surface that are 
approximately orthogonal to the direction of gaze. Fortunately, such parts of the 
surface contribute most data points; cf. ([!]). Observing that 

ln/O) = -a<j) 2 + 0(</> 4 ) = 2a (cos 0- 1) + 0(> 4 ) , (3) 

with some constant a > 0, we may neglect terms of order 0(<p ) to obtain 

T(r rvH^~ / 2a[n(d; c, p)-g-l]+ In iV(c) for dG5(c,p), , , 

[ ' P ' > ~ \ In 6 + In iV(c) for d e V(c, p)\5(c, p), [ > 

where n(d; c, p) is the object's inward surface-normal vector at the (close) surface 
point d € 5(c, p) and g is the sensor's direction of gaze; both are unit vectors. 
The constant In 6 < is effectively a penalty term for data points that come 
to lie in the object's interior under the hypothesis (c, p). Again, discarding the 
terms 0(4> 4 ) in (|^) is acceptable as data points producing a large error will be 
rare; cf. ([!]). 

3 Probabilistic Search for Model Matches 

In this section, we derive the probabilistic search algorithm. We first intro- 
duce another statistical criterion function of object parameters (c, p). The new 
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criterion, to be called truncated probability (TP), defines a sequence of (c, p)- 
hypotheses. The hypotheses are evaluated by the likelihood (|^), starting with 
the most probable (c, p)-value and proceeding to gradually less probable values, 
as estimated by the TP. The search terminates as soon as L(c,p;D) is large 
enough. 

We will obtain the TP from a series of steps that simplify from the ideal 
criterion, the posterior probability of object parameters (c, p). Although this 
derivation cannot be regarded as a thoroughly controlled approximation, it will 
make explicit the underlying assumptions and give some feeling for how far we 
have to depart from the ideal to arrive at a feasible criterion. 

The TP makes use of the probability of the presence of feature values consis- 
tent with object parameters (c, p). It is thus an essential property of the approach 
to treat features as random variables. 

3.1 The Search Sequence 

Let us define point features as surface points centered on certain local, isolated 
surface shapes. Examples of such point features are corners, saddle points, points 
of locally-extreme surface curvature etc. They are characterized by a pair (s, f ), 
where s G IN is the feature's shape-class label and f G IR 3 is its location. Consider 
now a random variable that takes values (s, f ) of point features that are related 
to the objects sought in the data. Its values are statistically dependent upon the 
range data D. Let us restrict the possible feature values (s, f) to s G {1,2,... , m} 
and f G D, such that only data points can be feature locations. This is equivalent 
to setting the feature-value probability to for values (s,f) with f ^ D. The 
restriction is not really correct as we will loose true feature locations between 
data points. However, it will greatly facilitate the search by limiting features 
to discrete values that lie close to the true object surfaces and, hence, include 
highly probable candidates. 

Let us introduce the concept of groups of point features. A feature group is 
a set G g = {(si, fi), (s 2 , h), ■ ■ ■ , (s g , f ff )} of feature values with fj =^ L\ for i ^ j. 
A group G g hence contains simultaneously possible values of g features. 

The best knowledge we could have about true values of the object parameters 
(c, p) is encapsulated in their posterior-probability density given the data D, 
p(c,p\D). However, we usually do not know this density and if we knew it, its 
maximization would pose a problem similar to our initial one of maximizing the 
generative model (|J). We can nonetheless expand it using feature groups, 

p(c,p\D) = J2p^P\D,G g ) P (G g \D) , 5>(G 9 |P) = 1 . (5) 

Gg G g 

The summations are over all possible groups G g of fixed size g. Their enumeration 
is straightforward but tedious to explicate, so we omit this here. Note that the 
expansion (Q) is only valid, if we can be sure to find g true feature locations 
among the data points. 
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Let us now simplify the density (0) to the density 



p g (c,p;D) :=Y,P( c MG g )p(G g \D) . (6) 

Unlike the posterior density (^J), p g depends upon the feature-group size g. For- 
mally p g is a Bayesian belief network with a <?-feature probability distribution 
at the intermediate node. Maximization of p g with respect to (c, p) is still too 
difficult a task, as all possible feature-group values contribute to all values of 
(c, p)-density. 

A radically simplifying step thus takes us naturally to the density 

q g (c,p;D) := ma,xp(c,p\G g ) p(G g \D) , (7) 

where for each (c, p) only the largest term in the sum of (||) contributes. Note 
that q g (c, p; D) < p g (c, p; D) for all (c, p), such that q g is not normalized on the 
set Q of object parameters (c, p). This density will nonetheless be useful for our 
purpose of guiding a search through S7, as there only relative densities matter. 

As pointed out above, for all this to make sense, we need g true feature 
locations among the data points. To be safe, one could be tempted to set g = 1. 
However, the simplifying step from density p g to density q g suggests that the 
latter will be more informative as to the true value of (c, p), if the sum in 
is dominated for high-density values of (c, p) by only few terms. Now, less 
than three point features do not generally define a finite set of consistent object 
parameters (c, p). The density p(c,p\G g ) will hence not exhibit a pronounced 
maximum for g < 3. High-density points of p g arise then from accumulation over 
many g-feature values, i.e., terms in (||), and are necessarily lost in q g . Groups 
of size g > 3, on the other hand, do define finite sets of consistent (c, p)-values, 
and p(c, p\G g ) has infinite density there. Altogether, feature groups of size g = 3 
seem to be a good choice for our search procedure; see, however, the discussion 
in Sect. § 

Let us now introduce the logarithmic truncated probability (TP), 

TP(c, p; D) := lim In f dp' q 3 (c, p'; D) , (8) 

JS.(p) 

where ^(p) is a sphere in pose space centered on p with radius e > 0. The 
integral and limit are needed to pick up infinities in the density (73(0, p; D); see 
below. The TP is truncated in a double sense: finite contributions from q 3 are 
truncated and 93 itself is obtained from truncating the sum in (Q) . 

According to the discussion above, it is expected that (c, p)-values of high 
likelihood (0) will mostly yield a high TP (||). Our search thus proceeds by 
evaluating, in that order, 

L(ci, pi; D), L(c 2 , p 2 ; D), .. . with 
TP(ci,pi;£>)>TP(c2,p 2 ; £>)>... , (c 1)P i)=arg max TP(c,p;£>) . (9) 
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The search stops, as soon as L(ck,Pk', D) > or all (c, p)-candidates have 
been evaluated. In the former case, the object identity and object pose pk 
are returned as the estimates. In the latter case, it is inferred that none of the 
objects sought is present in the scene. Alternatively, if we know a priori that one 
of the objects must be present, we may pick the object parameters that have 
scored highest under L in the whole sequence. The algorithm will be formulated 
in more detail in Sect. |3.3| . 

In the density (j^) that guides the search, one of the factors is the density of 
the object parameters conditioned on the values G 3 of a triple of point features. 
This is explicitly 

p(c,p\G 3 ) = l + [l-E!l? 3) 7i(Gs)] pi (c, p; G 3 ) for G 3 e C, (10) 
(p 2 (c,p;G 3 ) for G 3 ^C. 

Here 6 c ,c' is the Kronecker delta (elements of the unit matrix) and S(p — p') is 
the Dirac-delta distribution on pose space; C is the set of feature-triple values 
consistent with any object hypotheses; h(G 3 ) is the number of object hypotheses 
consistent with the feature triple G 3 ; (d(G 3 ), Pi(G 3 )) are the consistent (c, p)- 
hypotheses; 7i(Gs) € (0, 1) are probability-weighting factors for the hypotheses. 
Generally we have 

h(G 3 ) 

£ ^) < 1 . ( n ) 

i=l 

which leaves a non-vanishing probability 1 — 7, > that three consistent 
features do not all belong to one of the objects sought. Accordingly, the density 
p(c, p|G 3 ) on Q contains some finite, spread contribution cx pi(c, p; G 3 ), a "back- 
ground" density, in the consistent-features case. In the inconsistent-features case, 
the density is similarly spread on f2 and given by some finite term p2(c, p; G 3 ). 
Clearly, only the values (d,Pi) of object parameters where the density ( |i"o| ) is 
infinite will be visited during the search (^); cf. (^). 

The functions 7i(G 3 ), Gi(G 3 ), Pi(G 3 ) are computed by generalized geometric 
hashing. Let G 3 = {(si,fi), (s2,f2), (s 3 ,f 3 )}. As key into the hash table we use 
the pose invariants 

( Sl , S2 ,s 3 ,\\i 1 -i 2 \\,\\{ 2 -{ 3 \\,\\{ 3 -{ 1 \\) , (12) 

appropriately shifted, scaled, and quantized. Here || • || denotes the Euclidean 
vector norm. The poses Pi(G 3 ), i = 1,2,... , h(G 3 ), are computed from matching 
the triple G 3 to h(G 3 ) model triples from the hash table. Small model distortions 
that result from an error-tolerant match are removed by orthogonalization of the 
match-related pose transforms. The weight 7i(G 3 ) and the object class Ci(G 3 ) 
are drawn along with each matched model triple. 

The weights 7i(G 3 ) can be adapted during a training period and during op- 
eration of the system in an unsupervised manner. One simply needs to count 
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the number of instances where a feature triple G3 draws a correct set of object 
parameters (Ci,Pi), i.e., one with £(C», Pf, D) > 0. The adapted ^{G^) repre- 
sent empirical grouping laws for the point features. Since it will often be the case 
that ji(Gz) oc l/h{Gz) for all i = 1,2, ... ,h{Gs), it turns out that hypothe- 
ses drawn by feature triples less common among the sought objects tend to be 
visited before the ones drawn by more common triples during the search . 

In order to fully specify the TP (||) and formulate our search algorithm, it 
remains to establish a model for the feature- value probabilities p(Gs\D) given 
the data D; cf. (^). This is the subject of the next section. 



3.2 Inferring Local Surface Shape from Point-Relation Densities 

In principle, a generative model of range data on local surface shapes s can be 
built analogous to (0). Thus, we may construct a conditional probability density 
p(D\s,f, r), where r £ H 3 are the rotational parameters of the local surface 
patch represented by the point feature (s, f). During recognition, however, it 
would then be necessary to optimize the parameters r for each s £ {1,2,... , m} 
and f £ D, in order to obtain the likelihood of each feature value (s, f). Besides 
the huge overhead of doing so without being interested in an estimate for r, 
this procedure would confront us with a problem similar to our original one of 
optimizing p(D\c, p). On the other hand, the density p(D\s, f) is an infeasible 
model because of high-order correlations between the data points D. Neglecting 
these correlations would throw out all information on surface shape s. 

The strategy we pursue here is to capture some of the informative cor- 
relations between data points D in a family of new representations T(c) = 
{Zii, Z\ 2 , . . . , Aw}> c £ H 3 . Each T(c) represents the geometry of the data D 
relative to the point c, and each A t represents geometric relations between mul- 
tiple data points. A reasonable statistical model is then obtained by neglecting 
correlations between the Aj. 



Learning Point-Relation Densities. The point relations we are going to 
exploit here are between four data points, that is, tetrahedron geometries, where 
three of the points are selected from a spherical neighborhood of the fourth. For 
four points xi, X2, X3, c £ 1R 3 we define the map 



xi 

X2 

x 3. 



r/R 



Zi(c;x 1 ,x 2 ,x 3 ) := I d/^r 2 -^ | 6 [0,1] x [-1,1] x [0,1] , 
>/3V3(r 2 - d?) t 



(13) 



where r is the mean distance of the center c to Xi , x 2 , X3 , R is the radius of the 
spherical neighborhood of c, d is the (signed) distance of c to the plane defined 
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by xi,X2,X3, and a is the area of the triangle they definefj. Explicitly, 

r = l^||x l -c|| , (14) 
6 i=l 

, r / m II ( x i ~ c ) ' [( x 2 -xi) x (x 3 -xi)] || 
d = sgn[g • ( Xl - c)] || (X2 _ X1)X(X3 _ X1) || > (15) 

a = - || (x 2 - xi) x (x 3 - xi) || . (16) 

As seen in ([if]), the sign of d is determined by the direction of gaze g. We can 
now define the family of tetrahedron representations T(c) by 

T(c) := (z\(c;di,d2,d 3 ) {di, d 2 , d 3 } C D A max ||di-c||<i? 

I i— 1,2,3 

A max lldj — ell — min lid. — ell < e \ . (17) 

i=l,2,3 i=l,2,3 J 

The parameter e > sets a small tolerance for differences in distance to c among 
each data-point triple d l7 d2,d 3 , that is, ideally all three points have the same 
distance. In practice, the tolerance e has to be adapted to the density of range- 
data points D to obtain enough Z\-samples in each T(c). Figure [l] depicts the 
geometry of the tetrahedron representations. 



\ Fig. 1. Transformation of range data points to a tetrahedron 
\ representation. For an inspection point c, the 3-tuples (r, d, a) 
j are collected and normalized [cf. dl3)] for each triple of data 
I points di,d2,d 3 € D with (approximately) equal distance r € 
(0, R) from c. 



One of the nice properties of the tetrahedron representation (|17|) is that for 
any surface of revolution we get at its symmetry point c samples A £ T(c) that 
are confined to a surface in Z\-space characteristic of the original surface shape. 
For surfaces of revolution, the distinctiveness and low entropy of the distribution 
of range data D sampled from a surface in 3D is thus completely preserved in the 
tetrahedron representation. For more general surfaces, it is therefore expected 
that a high mutual information between Z\-samples and shape classes can be 
achieved. 

Our goal is to estimate point-relation densities for each shape class s = 
1,2,... ,m. Samples are taken from tetrahedron representations centered on 
feature locations f from a training set !F S , that is, from Uf e ^ s T(f). Moreover, 
we add a non-feature class s — that lumps together all the shapes we are 

1 The quantity introduced in ( |l3| ) will be denoted by A(c; xi, X2, X3), A(c), or simply 
A to indicate its dependence on points as needed. 
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not interested in. The result are the shape-conditioned densities p(A\s) for A 6 
[0, 1] x [—1, 1] x [0, 1] and s = 0, 1, 2, . . . , m. In particular, p(A\s) = p(A\s, f) for 

Z\eT(f). 

Estimation of p(A\s) is made simple by the fact that we get 0(n 3 ), i.e., a lot 
of samples of A for each feature location f with n data points within a distance 
R. For our application, it is sufficient to count Z\-samples in bins to estimate 
probabilities p(A\s) for discrete Z\-events as normalized frequencies. 



Inferring Local Surface Shape. Let the features' prior probabilities be p(s) 
for s = 1, 2, . . . , m, that is, p(s) £ (0, 1) is the probability that any given data 
point from D is a feature location of type s. The feature priors are virtually 
impossible to know for general scenes. We do know, however, that p(s) <C 1 for 
s = 1, 2, . . . , m, i.e., the overwhelming majority of data points are not feature 
locations. This must be so for a useful dictionary of point features. It thus makes 
sense to expand the logarithm of the shapes' posterior probabilities, given I 
samples A 1 (f),A 2 (f), ...,A t (f)e T(f), f e D, 



hxp[s\Ai(f), Z\ 2 (f), . . . , A t (f)] = \np(s) + ]T {hxp{Ai(f)\ S ] - Inp[A(f)|0]} 

+ 0[p(l),p(2),... ,p(m)} , (18) 

and neglect the terms 0[p(l) , p(2) , . . . ,p(m)]. Remember that p(Z\|0) is the A- 
density of the non-feature class. The expression (|l8|) neglects correlations be- 
tween the Ai(f) e T(f). 

The sample length I for each representation T(f) will usually be I <C |7Xf)|, 
the cardinality of the set T(f). It is in fact crucial that we do not have to generate 
the complete representations T(f) for recognition, as this would require 0(n 3 ) 
time for n data points within a distance R from f . Instead, it is possible to draw 
a random, unbiased sub-sample of T(f) in 0{n) and O(l) time. 

The first term in dl8| ) can be split into 

m 

lnp(s) = lnq(s) + lnJ2p(s') , (19) 



where q(s) € (0, 1) are the relative frequencies of shape classes s = 1, 2, . . . , m. 
Now, these frequencies do not differ between features over many orders of mag- 
nitude for a useful dictionary of point features^. The dependence of (|l9|) on the 



shape class s is thus negligible compared to the sum in (18) for reasonable sam 



pie lengths I ^> 1 (several tens to hundreds). The first term in (I18j) may hence 



2 A feature that is more than a hundred times less frequent than the others should 
usually be dropped for computational efficiency. 
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be disregarded, and we are left with the feature's log-probability 

#(*,f;D) :=\np[s\A 1 (f),A 2 (f),... , A t (f)] + const. 

i 

« £ {Inp[A(f)|«] - lnp[A(f)|0]} > ( 20 ) 

i=l 

that is, a log-likelihood ratio. 
3.3 The Search Algorithm 

We still need to determine the TP (||) for guiding the search @ . Let again G3 = 
{(si,fi), (s2,f2), (s3,fs)}. It is reasonable to assume statistical independence of 
the point features in G3, as long as there are many different objects in the 
world that have these features^. We hence model the feature-triples' posterior 
probability p{G 3 \ D) in (0) as 

3 

P(G 3 \D) = Y[p[ Si \Ai(fi), A 2 (fi), Mfi)} ex exp 
i=i 

The TP can then be expressed as 

h(G 3 ) 3 
i=l i=l 

(22) 

up to additive constants. The first term under the max-operation is the contri- 
bution from feature grouping^], the second from single features. In particular, we 
get 

TP(c,p;D) = -00 for 

(c,p) i {{Ci{Gs),Pi{G 3 )) \G 3 eC,ie{l,2,... ,h(G 3 )}} , (23) 

that is, for (c, p)-values not suggested by any feature triple G3. 

We are now prepared to formulate the search algorithm. The following is not 
meant to specify a flow of processing, but rather a logic sequence of operations. 

1. From all possible feature values (s, f ) € {1, 2, . . . , m} x D, select as feature 
candidates the values with log-probability <£(s, f ; D) > E. 

2. From the feature candidates, generate all triples G3. Use their keys ( p"2| ) into 
a hash table to find the associated object hypotheses (C^Ga), Pi(Gs)) and 
their weights ^{G^) for i = 1, 2, . . . , h(Gs). 

3 These objects may or may not be in our modeled set. 

4 The factor S Pi p.(c 3 ) 1S an idealization. Since the pose parameters p are continuous, 
the hashing with the key (^) is necessarily error tolerant. The factor is then to be 
replaced by an integral $-p.r Gi \ dp' 5(jp — p') for a set Vi(Gs) of pose parameters. 



E 



(21) 



TP(c. p; D) = max 

G 3 ec 
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3. For the object hypotheses obtained, compute the TPs. 

4. Evaluate the object hypotheses by the likelihood L in the order of decreasing 
TP, starting with the hypothesis scoring highest under TP. Skip duplicate 
hypotheses. Stop when L(c,p;D) > for a hypothesis (c, p). 

5. Return the last evaluated object parameters (c, p) as the result, if L(c, p; D) > 
0. Else conclude that there is none of the sought objects in the data. 

Some comments are in order. 

— Comment on step [j]: The restriction to the most probable feature values by 
introduction of the threshold E is optional. It is a powerful means, however, 
to radically speed up the algorithm, if there are many range data points 
D. If the probability measure on the features is reliable and E is adjusted 
conservatively, the risk of missing a good set of object parameters is very 



small; see Sect. 4.1 below. 

— Comment on step [2|: Feature triples G3 ^ C, that is, inconsistent triples, are 
discarded by the hashing process. 

— Comment on step ||: Computation of TP is cheap as the hashed weights 7, 
from step || and the feature's log-probabilities from step [j] can be used; cf. 
& 

— Comment on step 0: Duplicate hypotheses may be very unlikely, but this 
depends on when two hypotheses are considered equal. In any case, skipping 
them realizes the maxG 3 gc _ operation in the TP (p2|), ensuring that each 
hypothesis is drawn only by the feature triple which makes it most probable. 

— Comment on step ||: If the likelihood L does not reach the threshold 
but we know a priori that at least one of the objects must be present in 
the scene, the algorithm may return the object parameters (c, p) that have 
scored highest under L. 

A complete scene interpretation often involves more than one recognized set 
of object parameters (c, p). A simultaneous search for several objects is possible 
within the current framework, although it will be computationally very costly. 
It requires evaluation of TPs and likelihoods 

TP(ci,pi,c 2 ,P2, .••;£>) , £(ci,pi,c 2 ,p 2 , ■ ■ • ;D) (24) 

for a combinatorially enlarged set of object parameters (ci, pi, C2, P2, • ■ ■ )■ A 
much cheaper, although less reliable, alternative is sequential search, where the 
data belonging to a recognized object (c, p) are removed prior to the algorithm's 
restart. 



4 Experiments 

We have performed two types of experiment. One tests the quality of the measure 
( p0| ) of the features' posterior probability. The other probes the capabilities of 
the complete algorithm for object segmentation and recognition. All experiments 
were run on stereo data obtained from a three-camera device (the 'Triclops Stereo 
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Vision System', Point Grey Research Inc.). The stereo data were calculated 
from three 120 x 160-pixel images by a standard least-sum-of-pixcl-diffcrcnccs 
algorithm for correspondence search. The data were accordingly of a rather low 
resolution. They were moreover noisy, contained a lot of artifacts and outliers, 
and even visible surfaces lacked large regions of data points, as is not uncommon 
for stereo data; see Figs. || and [|. 



4.1 Local Shape Recognition 

Only correct feature values are able to draw correct object hypotheses (c, p), 
except from accidental, unprobable events. For an acceptable search length it is, 
therefore, crucial that correct feature values yield a high log-probability <P; cf. 



0[) and (22). This is even more crucial, if feature candidates are presele cted by 
imposing a threshold <P(s,f;D) > 5; cf. step [j]of the algorithm in Sect. 3.3 . 



As point features, we chose convex and concave rectangular corners. The 
training set contained 221 feature locations that produced 466,485 Z\-samples 
for the densities p(A\s), s E {"convex corner", "concave corner"}. For the non- 
feature class s — we obtained 9,623,073 Z\-samples from several thousand 
non- feature locations (mostly on planes, also some edges). The parameters for 
A sampling were R = 15 mm and e = 1 mm; cf. Sect. [T^. The Z\-samples were 
counted in 15 x 20 x 10 bins, which corresponds to the discretization used for 
the Z\-space [0, 1] x [-1, 1] x [0, 1]. 

Feature-recognition performance was evaluated on 10 scenes that contained 
(i) an object with planar surface segments, edges, saddle points, and altogether 
65 convex and concave corner points (feature locations); (ii) a piece of loosely 
folded cloth that contributed a lot of potential distractor shapes. The sample 
length was I = 50 Z\-samples; cf. Sect. [l| Since drawing Zi-samples is a stochastic 
process, we let the program run 10 times on each scene with different seeds for 
the random-number generator to obtain a more significant statistics. 

Corner points are particularly interesting as test features, since they allow for 
comparison of the proposed log-probability measure (|2^) with the classic method 
of curvature estimation. Let ci(f ; D) and C2(f ; D) be the principal curvatures of a 
second-order polynomial that is least-squares fitted to the data points from D in 
the i?-sphere around the point f . Because the corners are the maximally curved 
parabolic shapes in the scenes, it makes sense to use the curvature measures 

C(s — "convex corner", f; D) := min[ci(f; D), C2(f; D)] , (25) 
C(s = "concave corner", f; D) := min[— ci(f; D), — c 2 (f; D)] , (26) 

as alternative measures of "cornerness" of the point f for convex and concave 
shapes, respectively. We thus compare <P(s, f ; D) as defined in j20| ) to C(s, f ; D) 
on true and false feature values 

(s,f) € {"convex corner", "concave corner"} x D . (27) 

In Fig. H we show the distribution of <P- and C-scores for the 65 true feature 
values in relation to their distribution for all possible feature values, which are 
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more than 100,000. As can be seen, the and C-distributions for the popu- 
lation of true values are both distinct from, but overlapping with the <P- and 
C-distributions for the entire population of feature values. A qualitative differ- 
ence between <P- and C-scoring of feature values is more obvious in the ranking 
they produce. In Fig. || we show the distribution of and C-ranks of the true 
feature values among all feature values, with 1 being the first rank, i.e., highest 
<£/C-score, and the last. Both and C-ranks of true values are most fre- 
quently found close to 1 and gradually less frequently at lower ranks. There are 
almost no true features ^-ranked below 0.6. The C-ranks, in contrast, seem to 
scatter uniformly in [0, 1] for part of the true features. These features score under 
C as poorly as the non-features. If using the C-score, they would in practice not 
be available for object recognition, as they would only be selected to draw an 
object hypothesis after hundreds to thousands of false feature values. 



<f-score 



C-score 



-A 



-300 -200 -100 



Fig. 2. Score distribution of true (transparent bars, thick lines) and all possible 
(filled bars, thin lines) feature values as produced by the <£-score (left) and 
the C-score (right). Horizontally in each plot extends the score, vertically the 
normalized frequency of feature values. 





Fig. 3. Rank distribution of true feature values among all possible feature values 
as produced by the ^-score (left) and the C-score (right); rank 1 is the highest 
score in each data set, rank the lowest score. Horizontally in each plot extends 
the rank, vertically the normalized frequency of true feature values. 



The result of the comparison comes not as a complete surprise. Some kind of 
surface-fitting procedure is necessary for estimation of the principal curvatures 
[|| fL|. If not a lot of care is invested, fitting to data will always be very 
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sensitive to outliers and artifacts, of which there are many in typical stereo 
data. A robust fitting procedure, however, has to incorporate in some way a 
statistical model of the data that accounts for its outliers and artifacts. We here 
propose to go all the way to a statistical model, the point-relation densities, and 
to do without any fitting for local shape recognition. 

4.2 Object Segmentation and Recognition 

We have tested segmentation and recognition performance of the proposed al- 
gorithm on two objects. One is a cube from which 8 smaller cubes are removed 
at its corners; the other is a subset of this object. These two objects can be 
assembled to create challenging segmentation and recognition scenarios. 

The test data were taken from 50 scenes that contained the two objects to 
be recognized, and additional distractor objects in some of them. A single data 
set consisted of roughly 5,000 to 10,000 3D points. As an acceptable accuracy 
for recognition of an object's pose we considered what is needed for grasping 
with a hand (robot or human), i.e., less than 3 degrees of angular and less than 
5 mm of translational misalignment. For a fixed set of parameters of the search 
algorithm, in all but 3 cases the result was acceptable in this sense. Processing 
times ranged roughly between 1 and 50 seconds, staying mostly well below 5 
seconds, on a Sun UltraSPARC-Ill workstation (750 MHz). 

In Figs. ^]and|5| we show examples of scenes, camera images, range data, and 
recognition results. Edges drawn in the data outline the object pose recognized 
first. The other object can be recognized after deletion of data points belonging 
to the first object. 



5 Conclusion 

We have derived from a probabilistic perspective a new variant of hypothesize- 
and-test algorithm for object recognition. A critical element of any such algo- 
rithm is the ordered sequence of hypotheses that is tested. The ordered sequence 
that is generated by our algorithm is inferred from a statistical criterion, here 
called the truncated probability (TP) of object parameters. As a key component 
of the truncated probability, we have introduced point-relation densities, from 
which we obtain an estimate of posterior probability for surface shape given 
range data. 

One of the strengths of the algorithm, as demonstrated in experiments, is its 
very high degree of robustness to noise and artifacts in the data. Raw stereo-data 
points of low quality are sufficient for many recognition tasks. Such data are fast 
and cheap to obtain and are intrinsically invariant to changes in illumination. 



We have argued in Sect, [^l] that feature groups of size 3 = 3, i.e., minimal 
groups, are a good choice for guiding the search for model matches. However, 
if we can be sure that more than three true object features are represented in 
the data, it may be worth considering larger groups. For g > 3, the density 
(noft that enters the TP rt|) looks more complicated. In particular, hash tables of 
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Fig. 4. Example scene with the two objects we used for evaluation of segmen- 
tation and recognition performance. The smaller object is stacked on the larger 
as drawn near the center of the figure. Shown are one of the camera images and 
three orthogonal views of the stereo-data set. The object pose recognized first 
is outlined in the data. The length of both objects' longest edges is 13 cm. The 
recognition time was 1.2 seconds. 
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Fig. 5. Example scene with the two objects we used for evaluation of segmen- 
tation and recognition performance; cf. Fig. |i| The recognition time was 1.5 
seconds. 



1G 



higher dimension are needed to accommodate the grouping laws, i.e., the weights 
j(G g ). If we do without the grouping laws, using larger groups is equivalent to 
having more than just the largest term of the density (^) contributing to the 
TP; cf. (Q). In the limit of large feature groups, the approach then resembles 
generalized- Hough-transform and pose-clustering techniques [Q, |l7|, |l6| . 

One avenue of research that is suggested here is for generalizations of point- 
relation densities. Alternative representations of range data have to be explored 
for learning posterior-probability estimates for various surface shapes. 
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